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Abstract —Monaural source separation is important for many 
real world applications. It is challenging because, with only a 
single channel of information available, without any constraints, 
an infinite number of solutions are possible. In this paper, 
we explore joint optimization of masking functions and deep 
recurrent neural networks for monaural source separation tasks, 
including monaural speech separation, monaural singing voice 
separation, and speech denoising. The joint optimization of the 
deep recurrent neural networks with an extra masking layer 
enforces a reconstruction constraint. Moreover, we explore a 
discriminative criterion for training neural networks to further 
enhance the separation performance. We evaluate the proposed 
system on the TSP, MIR-IK, and TIMIT datasets for speech 
separation, singing voice separation, and speech denoising tasks, 
respectively. Our approaches achieve 2.30-4.98 dB SDR gain 
compared to NMF models in the speech separation task, 2.30- 
2.48 dB GNSDR gain and 4.32-5.42 dB GSIR gain compared 
to existing models in the singing voice separation task, and 
outperform NMF and DNN baselines in the speech denoising 
task. 

Index Terms —Monaural Source Separation, Time-Frequency 
Masking, Deep Recurrent Neural Network, Discriminative Train¬ 
ing 

1. Introduction 

S OURCE separation is a problem in which several sig¬ 
nals have been mixed together and the objective is to 
recover the original signals from the combined signals. Source 
separation is important for several real-world applications. 
For example, the accuracy of chord recognition and pitch 
estimation can be improved by separating the singing voice 
from the music accompaniment Q. The accuracy of automatic 
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speech recognition (ASR) can be improved by separating 
speech signals from noise ||^. Monaural source separation, i.e., 
source separation from monaural recordings, is particularly 
challenging because, without prior knowledge, there is an 
infinite number of solutions. In this paper, we focus on 
source separation from monaural recordings with applications 
to speech separation, singing voice separation, and speech 
denoising tasks. 

Several approaches have been proposed to address the 
monaural source separation problem. We categorize them into 
domain-specific and domain-agnostic approaches. For domain- 
specific approaches, models are designed according to the 
prior knowledge and assumptions of the tasks. For example, in 
singing voice separation tasks, several approaches have been 
proposed to exploit the assumption of the low rank and sparsity 
of the music and speech signals, respectively jn, In 

speech denoising tasks, spectral subtraction subtracts a 
short-term noise spectrum estimate to generate the spectrum 
of a clean speech. By assuming the underlying properties of 
speech and noise, statistical model-based methods infer speech 
spectral coefficients given noisy observations Q. However, in 
real-world scenarios, these strong assumptions may not always 
hold. For example, in the singing voice separation task, the 
drum sounds may lie in sparse subspaces instead of being low 
rank. In speech denoising tasks, the models often fail to predict 
the acoustic environments due to the non-stationary nature of 
noise. 

For domain-agnostic approaches, models are learned from 
data directly without having any prior assumption in the task 
domain. Non-negative matrix factorization (NMF) and 
probabilistic latent semantic indexing (PLSI) 0’ 
the non-negative reconstruction bases and weights of different 
sources and use them to factorize time-frequency spectral 
representations. NMF and PLSI can be viewed as a linear 
transformation of the given mixture features (e.g. magnitude 
spectra) during the prediction time. However, based on the 
minimum mean squared error (MMSE) estimate criterion, the 
optimal estimator E[Y|X] is a linear model in X only if X and 
Y are jointly Gaussian, where X and Y are the mixture and 
separated signals, respectively. In real-world scenarios, since 
signals might not always follow Gaussian distributions, linear 
models are not expressive enough to model the complicated 
relationship between separated and mixture signals. We con¬ 
sider the mapping relationship between the mixture signals 
and separated sources as a nonlinear transformation, and hence 
nonlinear models such as deep neural networks (DNNs) are 
desirable. 
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In this paper, we propose a general monaural source sepa¬ 
ration framework to jointly model all sources within a mixture 
as targets to a deep recurrent neural network (DRNN). We pro¬ 
pose to utilize the constraints between the original mixture and 
the output predictions through time-frequency mask functions 
and jointly optimize the time-frequency functions along with 
the deep recurrent neural network. Given a mixture signal, 
the proposed approach directly reconstructs the predictions 
of target sources in an end-to-end fashion. In addition, given 
that there are predicted results of competing sources in the 
output layer, we further propose a discriminative training 
criterion for enhancing the source to interference ratio. We 
extend our previous work in (TT) and (Tg and propose a 
general framework for monaural source separation tasks with 
applications to speech separation, singing voice separation, and 
speech denoising. We further extend our speech separation 
experiments in m to a larger speech corpus, the TSP 
dataset (H), with different model architectures and different 
speaker genders, and we extend our proposed framework to 
speech denoising tasks under various matched and mismatched 
conditions. 

The organization of this paper is as follows: Section [n| 
reviews and compares recent monaural source separation work 
based on deep learning models. Section III introduces the 
proposed methods, including the deep recurrent neural net¬ 
works, joint optimization of deep learning models and soft 
time-frequency masking functions, and the training objectives. 
Section [Tvl pr esents the experimental setting and results using 
the TSPHI^, MIR-IK ||^, and TIMIT (Tg datasets for 
speech separation, singing voice separation, and speech de¬ 
noising tasks, respectively. We conclude the paper in Section 

El 


II. Related Work 

Recently, deep learning based methods have started to 
attract much attention in the source separation research com¬ 
munity by modeling the nonlinear mapping relationship be¬ 
tween mixture and separated signals. Prior work on deep 
learning based source separation can be categorized into three 
categories, depending on the interaction between input mixture 
and output targets. 

Denoising-based approaches: These methods utilize deep 
learning based models to learn the mapping from the mixture 
signals to one of the sources among the mixture signals. In the 
speech recognition task, given noisy features, Maas et al. Q 
proposed to apply a DRNN to predict clean speech features. 
In the speech enhancement task, Xu et al. and Liu et 
al. fTTI proposed to use a DNN for predicting clean speech 
signals given noisy speech signals. The denoising methods do 
not consider the relationships between target and other sources 
in the mixture, which is suboptimal in the source separation 
framework where all the sources are important. In contrast, 
our proposed model considers all sources in the mixture and 
utilizes the relationship among the sources to formulate time- 
frequency masks. 

Time-frequency mask based approaches: A time- 
frequency mask considers the relationships among the 


sources in a mixture signal, enforces the constraints between 
an input mixture and the output predictions, and hence results 
in smooth prediction results. Weninger et al. p9| trained 
two long short-term memory (LSTM) RNNs for predicting 
speech and noise, respectively. A final prediction is made by 
applying a time-frequency mask based on the speech and noise 
predictions. Instead of training a model for each source and 
applying the time-frequency mask separately, our proposed 
model jointly optimizes time-frequency masks with a network 
which models all the sources directly. 

Another type of approach is to apply deep learning models 
to predict a time-frequency mask for one of the sources. After 
the time-frequency mask is learned, the estimated source is ob¬ 
tained by multiplying the learned time-frequency mask with an 
input mixture. Nie et al. | [20| utilized deep stacking networks 
with time series inputs and a re-threshold method to predict an 
ideal binary mask. Narayanan and Wang (D and Wang and 
Wang | [22| proposed a two-stage framework (DNNs with a 
one-layer perceptron and DNNs with an SVM) for predicting 
a time-frequency mask. Wang et al. p3| recently proposed 
to train deep neural networks for different targets, including 
ideal ratio mask, FFT-mask, and Gammatone frequency power 
spectrum for speech separation tasks. Our proposed approach 
learns time-frequency masks for all the sources internally 
with the DRNNs and directly optimizes separated results with 
respect to ground truth signals in an end-to-end fashion. 

Multiple-target based approaches: These methods model 
all output sources in a mixture as deep learning model training 
targets. Tu et al. p4| proposed modeling clean speech and 
noise as the output targets for a robust ASR task. However, the 
authors do not consider the constraint that the sum of all the 
sources is the original mixture. Grais et al. proposed using 
a deep neural network to predict two scores corresponding to 
the probabilities of two different sources respectively given 
a frame of normalized magnitude spectrum. Our proposed 
method also models all sources as training targets. We further 
enforce the constraints between an input mixture and the 
output predictions through time-frequency masks which are 
learned along with DRNNs. 

III. Proposed Methods 
A. Deep Recurrent Neural Networks 

Given that audio signals are time series in nature, we 
propose to model the temporal information using deep re¬ 
current neural networks for monaural source separation tasks. 
To capture the contextual information among audio signals, 
one way is to concatenate neighboring audio features, e.g., 
magnitude spectra, together as input features to a deep neural 
network. However, the number of neural network parameters 
increases proportionally to the input dimension and the number 
of neighbors in time. Hence, the size of the concatenating 
window is limited. Another approach is to utilize recurrent 
neural networks (RNNs) for modeling the temporal informa¬ 
tion. An RNN can be considered as a DNN with indefinitely 
many layers, which introduce the memory from previous time 
steps, as shown in Figure (a). The potential weakness for 
RNNs is that RNNs lack hierarchical processing of the input 
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(a) l-Iayer RNN (b) L-Iayer DRNN-/ (c) L-Iayer stacked RNN (sRNN) 

Fig. I. Deep Recurrent Neural Network (DRNN) architectures: Arrows represent connection matrices. Black, white, and gray circles represent input frames, 
hidden states, and output frames, respectively. The architecture in (a) is a standard recurrent neural network, (b) is an L hidden layer DRNN with recurrent 
connection at the l-th layer (denoted by DRNN-/), and (c) is an L hidden layer DRNN with recurrent connections at all levels (denoted by stacked RNN). 


at the current time step. To further provide the hierarchical 
information through multiple time scales, deep recurrent neural 
networks (DRNNs) are explored p6| , p7]|. We formulate 
DRNNs in two schemes as shown in Figure[2 (b) and Figure 
(c). The Figure (b) is an L hidden layer DRNN with 
temporal connection at the l-th layer. The Figure (c) is an 
L hidden layer DRNN with full temporal connections (called 
stacked RNN (sRNN) in p7|). Formally, we define the two 
DRNN schemes as follows. Suppose there is an L hidden layer 
DRNN with the recurrent connection at the l-th layer, the l-th 
hidden activation at time t, h^, is defined as: 

U = /h(xt,hUi) 

= <t>i (u'U-i + WV/-1 (w'-i (... <^1 (WN*)))) ( 1 ) 

and the output yt is defined as: 
yt = /o(U) 

= (w^- 1 (w'hb)) (2) 

where /h and /□ are a state transition function and an output 
function, respectively, is the input to the network at time t, 
0/(') is an element-wise nonlinear function at the l-th layer, 
is the weight matrix for the l-th layer, and is the 
weight matrix for the recurrent connection at the l-th layer. 
The recurrent weight matrix is a zero matrix for the rest 
of the layers where k ^ 1. The output layer is a linear layer. 

The stacked RNNs, as shown in Figure [^(c), have multiple 
levels of transition functions, defined as: 

h' = /h(h'-\hki) 

= <^,(U'hki+W'hk') (3) 

where h[ is the hidden state of the l-th layer at time f, 0/(') is 
an element-wise nonlinear function at the l-th layer, is the 
weight matrix for the l-th layer, and is the weight matrix 
for the recurrent connection at the l-th layer. When the layer 
/ = 1, the hidden activation is computed using = x^. 
For the nonlinear function similar to | [^ , we empirically 
found that using the rectified linear unit 0z(x) = max(0,x) 
performs better compared to using a sigmoid or tanh function 


in our experiments. Note that a DNN can be regarded as a 
DRNN with the temporal weight matrix as a zero matrix. 

For the computation complexity, given the same input 
features, during the forward-propagation stage, a DRNN with 
L hidden layers, m hidden units, and a temporal connection 
at the l-th layer requires an extra 0(m^) IEEE fioating point 
storage buffer to store the temporal weight matrix and 
extra 0(m^) multiply-add operations to compute the hidden 
activations in Eq. ^ at the l-th layer, compared to a DNN 
with L hidden layers and m hidden units. During the back- 
propagation stage, DRNN uses back-propagation through time 
(BPTT) p9| , p0| to update network parameters. Given an 
input sequence with T time steps in length, the DRNN with 
an l-th layer temporal connection requires an extra 0(Tm) 
space to keep hidden activations in memory and requires 
0(Tm^) operations (0(m^) operations per time step) for 
updating parameters, compared to a DNN (D Indeed, the 
only pragmatically significant computational cost of a DRNN 
with respect to a DNN is that the recurrent layer limits the 
granularity with which back-propagation can be parallelized. 
As gradient updates based on sequential steps cannot be 
computed in parallel, for improving the efficiency of DRNN 
training, utterances are chopped into sequences of at most 100 
time steps. 

B. Model Architecture 

We consider the setting where there are two sources addi- 
tively mixed together, though our proposed framework can be 
generalized to more than two sources. At time f, the training 
input Xt of the network is the concatenation of features, e.g., 
logmel features or magnitude spectra, from a mixture within 
a window. The output targets yi^ G and y^^ G and 
the output predictions yi^ G and y^^ G of the deep 
learning models are the magnitude spectra of different sources, 
where F is the magnitude spectral dimension. 

Since our goal is to separate different sources from a 
mixture, instead of learning one of the sources as the target, 
we propose to simultaneously model all the sources. Eigurej^ 
shows an example of the architecture, which can be viewed 
as the t-th column in Eigure 
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Source 1 Source 2 



Fig. 2. Proposed neural network architecture, which can be viewed as the t-th 
column in Figure^ We propose to jointly optimize time-frequency masking 
functions as a layer with a deep recurrent neural network. 


Moreover, we find it useful to further smooth the source 
separation results with a time-frequency masking technique, 
for example, binary time-frequency masking or soft time- 
frequency masking Q> HD’ |T^ , p2| . The time-frequency 
masking function enforces the constraint that the sum of the 
prediction results is equal to the original mixture. Given the 
input features from the mixture, we obtain the output 
predictions yi^ and y 2 t through the network. The soft time- 
frequency mask mt G is defined as follows: 


mt = 


|yi«l 

|yij + Iy 2 j 


(4) 


where the addition and division operators are element-wise 
operations. 

Similar to |T9| , a standard approach is to apply the time- 
frequency masks nit and 1 — mt to the magnitude spectra 
zt G of the mixture signals, and obtain the estimated 
separation spectra Si^ G and § 2 ^ G which correspond 
to sources 1 and 2, as follows: 


Sit = mt 0 Zt 

S2t = (1 - mt) © Zt 

where the subtraction and 0 (Hadamard product) operators 
are element-wise operations. 

Given the benefit of smoothing separation and enforcing the 
constraints between an input mixture and the output predic¬ 
tions using time-frequency masks, we propose to incorporate 
the time-frequency masking functions as a layer in the neural 
network. Instead of training the neural network and applying 
the time-frequency masks to the predictions separately, we 
propose to jointly train the deep learning model with the time- 
frequency masking functions. We add an extra layer to the 
original output of the neural network as follows: 


yit 

y2t 


|yij 

lyitl + |y2j 
|y2j 

lyitl + |y2j 


©Zt 


0Zt 


( 6 ) 


where the addition, division, and 0 (Hadamard product) oper¬ 
ators are element-wise operations. The architecture is shown in 
Figure In this way, we can integrate the constraints into the 
network and optimize the network with the masking functions 
jointly. Note that although this extra layer is a deterministic 
layer, the network weights are optimized for the error metric 
between yi^, y 2 t and yi^, y 2 ^, using the back-propagation 
algorithm. The time domain signals are reconstructed based 
on the inverse short-time Fourier transform (ISTFT) of the 
estimated magnitude spectra along with the original mixture 
phase spectra. 


C. Training Objectives 

Given the output predictions yi^ and y 2 t (or yi^ and y 2 t) 
of the original sources yi^ and y 2 ^, t = 1,. .. ,T, where T 
is the length of an input sequence, we optimize the neural 
network parameters by minimizing the squared error: 

^MSE = \Y^ (llyit - yit 111 + Ily2t - y2t 111) (7) 

^ t=i 

In Eq. we measure the difference between the predicted 
and the actual targets. When targets have similar spectra, it 
is possible for the DNN to minimize Eq. 0 by being too 
conservative: when a feature could be attributed to either 
source 1 or source 2, the neural network attributes it to both. 
The conservative strategy is effective in training, but leads 
to reduced signal-to-interference ratio (SIR) in testing, as the 
network allows ambiguous spectral features to bleed through 
partially from one source to the other. We address this issue 
by proposing a discriminative network training criterion for 
reducing the interference, possibly at the cost of increased 
artifacts. Suppose that we define 

-^DIS = -(1 - 7) lnpi2(y) - 7 ^kl(pi2||P2i) ( 8 ) 

where 0 < 7 < 1 is a regularization constant. Pi 2 (y) is the 
likelihood of the training data under the assumption that the 
neural net computes the MSE estimate of each feature vector 
(i.e., its conditional expected value given knowledge of the 
mixture), and that all residual noise is Gaussian with unit 
covariance, thus 

inpi2(y) = (llyit - yit 11^ + Ily2, - y2, f) (9) 

^ t=l 

The discriminative term, T)kl(pi 2 ||P 2 i), is a point estimate of 
the KL divergence between the likelihood model Pi 2 (y) and 
the model P 2 i(y), where the latter is computed by swapping 
affiliation of spectra to sources, thus 

Dkl{pi2\\p2i) = (llyit + Ily2t -yitlP- 

llyit -yitf - I|y2t -y2jp) 

( 10 ) 
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012345 012345 012345 012345 012345 

Time (s) Time (s) Time (s) Time (s) Time (s) 


(a) Mixture 


(b) Original female voice (c) Recovered female voice 


(d) Original male voice 


(e) Recovered male voice 


Fig. 3. A speech separation example using the TSP dataset, (a) The mixture (female (FA) and male (MC) speech) magnitude spectrogram for a test clip in 
TSP; (b) the ground truth spectrogram of the female speech; (c) the separated female speech spectrogram from our proposed model (DRNN-1 + discrim); (d) 
the ground truth spectrogram of the male speech; (e) the separated male speech spectrogram from our proposed model (DRNN-1 + discrim). 


I 



(a) Mixture 


(b) Original singing 


(c) Recovered singing 


(d) Original music 


(e) Recovered music 


Fig. 4. A singing voice separation example using the MIR-IK dataset, (a) The mixture (singing voice and music accompaniment) magnitude spectrogram 
for the clip Yifen_2_07 in MIR-IK; (b) the ground truth spectrogram for the singing voice; (c) the separated signing voice spectrogram from our proposed 
model (DRNN-2 + discrim); (d) the ground truth spectrogram for the music accompaniment; (e) the separated music accompaniment spectrogram from our 
proposed model (DRNN-2 + discrim). 



0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 

Time (s) Time (s) Time (s) 


(a) Mixture 


(b) Original speech 


(c) Recovered speech 


(d) Original noise 


(e) Recovered noise 




Fig. 5. A speech denoising example using the TIMIT dataset, (a) The mixture (speech and babble noise) magnitude spectrogram for a test clip in TIMIT; 
(b) the ground truth spectrogram for the speech; (c) the separated speech spectrogram from our proposed model (DNN); (d) the ground truth spectrogram for 
the babble noise; (e) the separated babble noise spectrogram from our proposed model (DNN). 


Combining Eqs. ((8l(-([Tg gives a discriminative criterion 
with a simple and useful form: 

-^DIS = IY. “ yi* 11^ + 11^2* - y2t 

^ t=l 

7llyit -y2j|^ -7lly2t -yiJI^) (H) 

Although Eq. 0 directly optimizes the reconstruction objec- 
live, adding the extra term -7||yit - ya* IP - 7l|y2t - yi* IP 
in Eq. O further penalizes the interference from the other 
source, and can be viewed as a regularizer of Eq. Q during the 
training. Erom our experimental results, we generally achieve 
higher source to interference ratio while maintaining similar 
or higher source to distortion ratio and source to artifacts ratio. 

IV. Experiments 

In this section, we evaluate the proposed models on three 
monaural source separation tasks: speech separation, singing 
voice separation, and speech denoising. We quantitatively eval¬ 
uate the source separation performance using three metrics: 
Source to Interference Ratio (SIR), Source to Artifacts Ratio 
(SAR), and Source to Distortion Ratio (SDR), according to the 
BSS-EVAL metrics p^ . SDR is the ratio of the power of the 
input signal to the power of the difference between input and 


reconstructed signals. SDR is therefore exactly the same as 
the classical measure “signal-to-noise ratio” (SNR), and SDR 
reflects the overall separation performance. In addition to SDR, 
SIR reports errors caused by failures to fully remove the in¬ 
terfering signal, and SAR reports errors caused by extraneous 
artifacts introduced during the source separation procedure. In 
the past decade, the source separation community has been 
seeking more precise information about source reconstruc¬ 
tion performance; in particular, recent papers (^, and 
competitions (e.g.. Signal Separation Evaluation Campaign 
(SiSEC), Music Information Retrieval Evaluation (MIREX)) 
now separately report SDR, SIR, and SAR for objectively 
comparing different approaches. Note that these measures are 
deflned so that distortion = interference artifacts. Eor the 
speech denoising task, we additionally compute the short-time 
objective intelligibility measure (STOI) which is a quantitative 
estimate of the intelligibility of the denoised speech p5| . 
Higher values of SDR, SAR, SIR, and STOI represent higher 
separation quality. 

We use the abbreviations DRNN-k and sRNN to denote the 
DRNN with the recurrent connection at the k-th hidden layer, 
or at all hidden layers, respectively. Examples are shown in 
Eigure[^ We select the architecture and hyperparameters (the 7 
parameter in Eq. ED’ the mini-batch size, L-BEGS iterations. 
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Female (FA) vs. Male (MC), Spectral Features 


SDR 


SIR 


SAR 



1. NMF 2. DNN+w/o joint 3. DNN 4. DNN+discrim 5. DRNN-1 6. DRNN-l+discrim 7. DRNN-2 8. DRNN-2+discrim 9. sRNN 10. sRNN+discrim 


Female (FA) vs. Male (MC), Logmel Features 



Fig. 6. TSP speech separation results (Female vs. Male), where “w/o joint” indicates the network is not trained with the masking layer, and “discrim” indicates 
the training with the discriminative objective. Note that the NMF model uses spectral features. 


Female (FA) vs. Female (FB), Spectral Features 



1. NMF 2. DNN+w/o joint 3. DNN 4. DNN+discrim 5. DRNN-1 6. DRNN-l+discrim 7. DRNN-2 8. DRNN-2+discrim 9. sRNN 10. sRNN+discrim 


Fig. 7. TSP speech separation results (Female vs. Female), where “w/o joint” indicates the network is not trained with the masking layer, and “discrim” 
indicates the training with the discriminative objective. Note that the NMF model uses spectral features. 


and the circular shift size of the training data) based on the 
development set performance. 

We optimize our models by back-propagating the gradi¬ 
ents with respect to the training objective in Eq. (TT) . We 
use the limited-memory Broyden-Fletcher-Goldfarb-Shanno 
(L-BFGS) algorithm to train the models from random 
initialization. Examples of the separation results are shown in 
Figures and The sound examples and source codes of 
this work are available onlineQ 

A. Speech Separation Setting 

We evaluate the performance of the proposed approaches for 
a monaural speech separation task using the TSP corpus (T3) 
There are 1444 utterances, with average length 2.372 s, spoken 
by 24 speakers (half male and half female). We choose four 

^ https://sites.googIe.com/site/deepIearningsourceseparation/ 


speakers, FA (female), FB (female), MC (male), and MD (male), 
from the TSP speech database. After concatenating together 
60 sentences for each speaker, we use 80% of the signals 
for training, 10% for development, and 10% for testing. The 
signals are downsampled to 16 kHz. The neural networks are 
trained on three different mixing cases: FA versus MC, FA 
versus FB, and MC versus MD. Since FA and FB are female 
speakers while MC and MD are male, the latter two cases 
are expected to be more difficult due to the similar frequency 
ranges from the same gender. After normalizing the signals 
to have 0 dB input SNR, the neural networks are trained to 
learn the mapping between an input mixture spectrum and the 
corresponding pair of clean spectra. 

As for the NMF experiments, 10 to 100 speaker-specific 
basis vectors are trained from the training part of the signals. 
The optimal number of basis vectors is chosen based on the 
development set. We empirically found that using 20 basis 
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1. NMF 2. DNN+w/o joint 3. DNN 4. DNN+discrim 5. DRNN-1 6. DRNN-H-discrim 7. DRNN-2 8. DRNN-2+discrim 9. sRNN 10. sRNN+discrim 



1. NMF 2. DNN+w/o joint 3. DNN 4. DNN+discrim 5. DRNN-1 6. DRNN-l+discrim 7. DRNN-2 8. DRNN-2+discrim 9. sRNN 10. sRNN+discrim 


Fig. 8. TSP speech separation results (Male vs. Male), where “w/o joint” indicates the network is not trained with the masking layer, and “discrim” indicates 
the training with the discriminative objective. Note that the NMF model uses spectral features. 


vectors achieves the best performance on the development 
set in the three different mixing cases. The NMF separation 
is done by fixing the known speakers’ basis vectors during 
the test procedure and learning the speaker-specific activation 
matrices. 

We explore two different types of input features: spectral 
and log-mel filterbank features. The spectral representation 
is extracted using a 1024-point shot-time Fourier transform 
(STFT) with 50% overlap. In the speech recognition literature 
(37)’ the log-mel filterbank is found to provide lower word- 
error-rate compared to mel-frequency cepstral coefficients 
(MFCC) and log FFT bins. The 40-dimensional log-mel rep¬ 
resentation and the first- and second-order derivative features 
are used in the experiments. For the neural network training, in 
order to increase the variety of training samples, we circularly 
shift (in the time domain) the signals of one speaker and mix 
them with utterances from the other speaker. 

B. Speech Separation Results 

We use the standard NMF with the generalized KL- 
divergence metric as our baseline. We report the best NMF 
results among models with different basis vectors, as shown 
in the first column of Figures §0 and [8] Note that NMF 
uses spectral features, and hence the results in the second row 
(log-mel features) of each figure are the same as the first row 
(spectral features). 

The speech separation results of the cases, FA versus MC, 
FA versus FB, and MC versus MD, are shown in Figures 
and[^ respectively. We train models with two hidden layers 
of 300 hidden units using features with a context window 
size of one frame (one frame within a window), where the 
architecture and the hyperparameters are chosen based on the 
development set performance. We report the results of single 
frame spectra and log-mel features in the top and bottom rows 
of Figures [7j and respectively. To further understand the 
strength of the models, we compare the experimental results 
in several aspects. In the second and third columns of Figures 


§0 and[^ we examine the effect of joint optimization of the 
masking layer and the DNN. Jointly optimizing the masking 
layer significantly outperforms the cases where the masking 
layer is applied separately (the second column). In the FA vs. 
FB case, DNN without joint optimization of the masking layer 
achieves high SAR, but results in low SDR and SIR. In the top 
and bottom rows of Figures and[^ we compare the results 
between spectral features and log-mel features. In the joint 
optimization case, (columns 3-10), log-mel features achieve 
higher SDRs, SIRs, and SARs compared to spectral features. 
On the other hand, spectral features achieve higher SDRs and 
SIRs in the case where DNN is not jointly trained with a 
masking layer, as shown in the second column of Figures 
|7j and [8] In the FA vs. FB and MC vs. MD cases, the log-mel 
features outperform spectral features greatly. 

Between columns 3, 5, 7, and 9, and columns 4, 6, 8, and 10 
of Figures [^[T] and[^ we make comparisons between various 
network architectures, including DNN, DRNN-1, DRNN-2, 
and sRNN. In many cases, recurrent neural network models 
(DRNN-1, DRNN-2, or sRNN) outperform DNN. Between 
columns 3 and 4, columns 5 and 6, columns 7 and 8, and 
columns 9 and 10 of Figures i0 and we compare the 
effectiveness of using the discriminative training criterion, i.e., 
7 > 0 in Eq. ( pT] ). In most cases, SIRs are improved. The 
results match our expectation when we design the objective 
function. However, it also leads to some artifacts which 
result in slightly lower SARs in some cases. Empirically, the 
value 7 is in the range of 0.01-0.1 in order to achieve SIR 
improvements and maintain reasonable SAR and SDR. 

Einally, we compare the NME results with our proposed 
models with the best architecture using spectral and log-mel 
features, as shown in Eigure NME models learn activation 
matrices from different speakers and hence perform poorly in 
the same sex speech separation cases, EA vs. EB and MC vs. 
MD. Our proposed models greatly outperform NME models 
for all three cases. Especially for the EA vs. EB case, our 
proposed model achieves around 5 dB SDR gain compared to 
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(a) Female (FA) vs. Male (MC) 




1. NMF 2. DRNN+dischm+spectra 3. DRNN+discrim+logmel 


Fig. 9. TSP speech separation result summary. We compare the results under 
three settings, (a) Female vs. Male, (b) Female vs. Female, and (c) Male vs. 
Male, using the NMF model, the best DRNN+discrim architecture with spectra 
features, and the best DRNN+discrim architecture with log-mel features. 


the NMF model while maintaining higher SIR and SAR. 

C Singing Voice Separation Setting 

We apply our models to a singing voice separation task, 
where one source is the singing voice and the other source is 
the background music. The goal is to separate singing voice 
from music recordings. 

We evaluate our proposed system using the MIR-IK dataset 
|T^P| A thousand song clips are encoded at a sampling rate 
of 16 KHz, with a duration from 4 to 13 seconds. The clips 
were extracted from 110 Chinese karaoke songs performed by 
both male and female amateurs. There are manual annotations 
of the pitch contours, lyrics, indices and types for unvoiced 
frames, and the indices of the vocal and non-vocal frames; 
none of the annotations were used in our experiments. Each 
clip contains the singing voice and the background music in 
different channels. 

Following the evaluation framework in 0, 0, we use 175 
clips sung by one male and one female singer (“abjones” and 
“amy”) as the training and development set|^The remaining 
825 clips of 17 singers are used for testing. For each clip, we 
mixed the singing voice and the background music with equal 
energy, i.e., 0 dB SNR. 

To quantitatively evaluate the source separation results, we 
report the overall performance via Global NSDR (GNSDR), 
Global SIR (GSIR), and Global SAR (GSAR), which are 
the weighted means of the NSDRs, SIRs, SARs, respectively, 
over all test clips weighted by their length. Normalized SDR 
(NSDR) is defined as: 

NSDR(v, V, x) = SDR(v, v) - SDR(x, v) (12) 

where v is the estimated singing voice, v is the original clean 
singing voice, and x is the mixture. NSDR is for estimating the 

^https://sites.googIe.com/site/unvoicedsoundseparation/mir-Ik 
^Four clips, abjones_5_08, abjones_5_09, amy_9_08, amy_9_09, are used 
as the development set for adjusting the hyperparameters. 


TABLE I 

MIR-IK SEPARATION RESULT COMPARISON USING DEEP NEURAL 
NETWORKS WITH SINGLE SOURCE AS A TARGET AND USING TWO 
SOURCES AS TARGETS (WITH AND WITHOUT JOINT OPTIMIZATION OE THE 
MASKING LAYERS AND THE DNNS). 


Model (num. of output 
sources, joint optimization) 

GNSDR 

GSIR 

GSAR 

DNN (1, no) 

5.64 

8.87 

9.73 

DNN (2, no) 

6.44 

9.08 

11.26 

DNN (2, yes) 

6.93 

10.99 

10.15 


improvement of the SDR between the preprocessed mixture x 
and the separated singing voice v. 

For the neural network training, in order to increase the 
variety of training samples, we circularly shift (in the time 
domain) the signals of the singing voice and mix them with 
the background music. In the experiments, we use magnitude 
spectra as input features to the neural network. The spectral 
representation is extracted using a 1024-point STFT with 50% 
overlap. Empirically, we found that using log-mel filterbank 
features or log power spectrum provide worse performance 
than using magnitude spectra in the singing voice separation 
task. 


D. Singing Voice Separation Results 


In this section, we compare various deep learning models 
from several aspects, including the effect of different output 
formats, the effect of different deep recurrent neural network 
structures, and the effect of discriminative training. 

For simplicity, unless mentioned explicitly, we report the 
results using three hidden layers of 1000 hidden units deep 
neural networks with the mean squared error criterion, joint 
optimization of the masking layer, and 10 K samples as the 
circular shift step size using features with a context window 
size of three frames (three frames within a window). 

Table presents the results with different output layer 
formats. We compare using single source as a target (row 1) 
and using two sources as targets in the output layer (row 2 and 
row 3). We observe that modeling two sources simultaneously 
provides higher performance in GNSDR, GSIR, and GSAR. 
Comparing row 2 and row 3 in Table |T| we observe that jointly 
optimizing the masking layer and the DRNN further improves 
the results. 

Table |n| presents the results of different deep recurrent 
neural network architectures (DNN, DRNN with different 
recurrent connections, and sRNN) with and without discrim¬ 
inative training. We can observe that discriminative training 
further improves GSIR while maintaining similar GNSDR and 
GSAR. 


Finally, we compare our best results with other previous 
work under the same setting. Table III shows the results with 
unsupervised and supervised settings. Our proposed models 
achieve 2.30-2.48 dB GNSDR gain, 4.32-5.42 dB GSIR gain 
with similar GSAR performance, compared with the RNMF 
model 0. 
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TABLE II 

MIR-IK SEPARATION RESULT COMPARISON EOR THE EEEECT OE 
DISCRIMINATIVE TRAINING USING DIEEERENT ARCHITECTURES. 
UlSCRIM” DENOTES THE MODELS WITH DISCRIMINATIVE TRAINING. 


Model 

GNSDR 

GSIR 

GSAR 

DNN 

6.93 

10.99 

10.15 

DRNN-1 

7.11 

11.74 

9.93 

DRNN-2 

7.27 

11.98 

9.99 

DRNN-3 

7.14 

11.48 

10.15 

sRNN 

7.09 

11.72 

9.88 

DNN -1- discrim 

7.09 

12.11 

9.67 

DRNN-1 -1- discrim 

7.21 

12.76 

9.56 

DRNN-2 -1- discrim 

7.45 

13.08 

9.68 

DRNN-3 -1- discrim 

7.09 

11.69 

10.00 

sRNN -1- discrim 

7.15 

12.79 

9.39 


TABLE III 

MIR-IK SEPARATION RESULT COMPARISON BETWEEN OUR MODELS AND 
PREVIOUS PROPOSED APPROACHES. “DISCRIM” DENOTES THE MODELS 
WITH DISCRIMINATIVE TRAINING. 


Unsupervised 

Model 


GNSDR 

GSIR 

GSAR 

RPCA m 


3.15 

4.43 

11.09 

RPC Ah p 

nr 

3.25 

4.52 

11.10 

RPCAh FAS 

5t0 

3.84 

6.22 

9.19 

Supervised 

Model 

GNSDR 

GSIR 

GSAR 

MLRR 


3.85 

5.63 

10.70 

RNMF p 


4.97 

7.66 

10.03 

DRNN-T 

7.27 

11.98 

9.99 

DRNN-2 + discrim 

7.45 

13.08 

9.68 


E. Speech Denoising Setting 

We apply the proposed framework to a speech denoising 
task, where one source is the clean speech and the other source 
is the noise. The goal of the task is to separate clean speech 
from noisy speech. In the experiments, we use magnitude 
spectra as input features to the neural network. The spectral 
representation is extracted using a 1024-point STFT with 
50% overlap. Empirically, we found that log-mel filterbank 
features provide worse performance than magnitude spectra. 
Unless mentioned explicitly, we use two hidden layers of 1000 
hidden units deep neural networks with the mean squared 
error criterion, joint optimization of the masking layer, and 
10 K samples as the circular shift step size, using features 
with a context window size of one frame (one frame within a 
window). The model is trained and tested on 0 dB mixtures, 
without input normalization. 

To understand the effect of degradation in the mismatch 
condition, we set up the experimental recipe as follows. We 
use a hundred utterances spanning ten different speakers from 
the TIMIT database. We also use a set of five noises: Airport, 
Train, Subway, Babble, and Drill. We generate a number of 
noisy speech recordings by selecting random subsets of noises 
and overlaying them with speech signals. We also specify the 
signal to noise ratio when constructing the noisy mixtures. 
After we complete the generation of the noisy signals, we 
split them into a training set and a test set. 



Metric 


Fig. 10. Speech denoising architecture comparison, where “+discrim” indi¬ 
cates the training with the discriminative objective, and the bars show average 
values and the vertical lines on the bars denote minimum and maximum 
observed values. Models are trained and tested on 0 dB SNR inputs. The 
average STOI score for unprocessed mixtures is 0.675. 


Performance with Unknown Gains 



SDR SIR SAR STOI 

Metric 


Fig. II. Speech denoising using multiple SNR inputs and testing on a model 
that is trained on 0 dB SNR, where the bars show average values and the 
vertical lines on the bars denote minimum and maximum observed values. 
The left/back, middle, right/front bars in each pair show the results of NMF, 
DNN without joint optimization of the masking layer (It) , and DNN with 
joint optimization of the masking layer, respectively. The average STOI scores 
for unprocessed mixtures at -18 dB, -12 dB, -6 dB, 0 dB, 6 dB, 12 dB, and 20 
dB SNR are 0.370, 0.450, 0.563, 0.693, 0.815, 0.903, and 0.968, respectively. 


F. Speech Denoising Results 

In the following experiments, we examine the effect of the 
proposed methods under various scenarios. We first evaluate 
various architectures using 0 dB SNR inputs, as shown in 
Figure 10 We can observe that the recurrent neural network 
architectures (DRNN-1, DRNN-2, sRNN) achieve similar 
performance compared to the DNN model. Including the 
discriminative training objective improves SDR and SIR, but 
results in slightly degraded SAR and similar STOI values. 

To further evaluate the robustness of the model, we examine 
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(c) Unknown noise (d) Unknown speakers and noise 

Fig. 12. Speech denoising experimental results comparison between NMF, DNN without joint optimization of the masking layer Gzl and DNN with joint 
optimization of the masking layer, given 0 dB SNR inputs, when used on data that is not represented in training. The bars show average values and the vertical 
lines on the bars denote minimum and maximum observed values. We show the separation results of (a) known speakers and noise, (b) unseen speakers, (c) 
unseen noise, and (d) unseen speakers and noise. The average STOI scores for unprocessed mixtures for cases (a), (b), (c), and (d) are 0.698, 0.686, 0.705, 
and 0.628, respectively. 


our model under a variety of situations in which it is presented 
with unseen data, such as unseen SNRs, speakers, and noise 
types. These tests provide a way of understanding the perfor¬ 
mance of the proposed approach under mismatched conditions. 
In Figure [TT] we show the robustness of this model under 
various SNRs. The model is trained on 0 dB SNR mixtures 
and it is evaluated on mixtures ranging from 20 dB SNR to -18 
dB SNR. We compare the results between NMF, DNN without 
joint optimization of the masking layer, and DNN with joint 
optimization of the masking layer. In most cases, DNN with 
joint optimization achieves the best results, especially under 
low SNR inputs. For the 20 dB SNR case, NMF achieves the 
best performance. DNN without joint optimization achieves 
the highest SIR given high SNR inputs, though SDR, SAR, 
and STOI are lower than the DNN with joint optimization. 
Note that in our approach, joint optimization of the time- 
frequency masks and DNNs can be viewed as a way to directly 
incorporate the FFT-MASK targets into the DNNs for 
both speech and noise, where authors in found FFT- 


MASK has achieved better performance compared to other 
targets in speech denoising tasks. 

Next, we evaluate the models under three different cases: (1) 
the testing noise is unseen in training, (2) the testing speaker is 
unseen in training, and (3) both the testing noise and testing 
speaker are unseen in training stage. For the unseen noise 
case, we train the model on mixtures with Babble, Airport, 
Train and Subway noises, and evaluate it on mixtures that 
include a Drill noise (which is significantly different from the 
training noises in both spectral and temporal structure). For 
the unknown speaker case, we hold out some of the speakers 
from the training data. For the case where both the noise and 
speaker are unseen, we use the combination of the above. 


We compare our proposed approach with the NMF model 
and DNN without joint optimization of the masking layer 
(TT) The models are trained and tested on 0 dB SNR inputs, 
and these experimental results are shown in Figure 12 For 
the unknown speaker case, as shown in Figure (b), we 
observe that there is only a mild degradation in performance 
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for all models compared to the case where the speakers are 
known in Figure(a). The results suggest that the approaches 
can be easily used in speaker variant situations. In Figure [T^ 
(c), with the unseen noise, we observe a larger degradation 
in results, which is expected due to the drastically different 
nature of the noise type. For the case where both the noise 
and speakers are unknown, as shown in Figure [^(d), all three 
models achieve the worst performance compare to the other 
cases. Overall, the proposed approach generalizes well across 
speakers and achieves higher source separation performance, 
especially in SDRs, compared to the baseline models under 
various conditions. 

G. Discussion 

Throughout the experiments in speech separation, singing 
voice separation, and speech denoising tasks, we have seen 
significant improvement over the baseline models under var¬ 
ious settings, by the use of joint optimization of time- 
frequency masks with deep recurrent neural networks and 
the discriminative training objective. By jointly optimizing 
time-frequency masks with deep recurrent neural networks, 
the proposed end-to-end system outperforms baseline models 
(such as NMF, DNN models without joint optimization) in 
matched and mismatched conditions. Given audio signals are 
time series in nature, we explore various recurrent neural 
network architectures to capture temporal information and 
further enhance performance. Though there are extra memory 
and computational costs compared to feed-forward neural 
networks, DRNNs achieve extra gains, especially in the speech 
separation (0.5 dB SDR gain) and singing voice separation 
(0.34 dB GNSDR gain) tasks. Similar observations can be 
found in related work using LSTM models |T9| , p9| , where 
the authors observe significant improvements using recurrent 
neural networks compared with DNN models. Our proposed 
discriminative objective can be viewed as a regularization 
term towards the original mean-squared error objective. By 
enforcing the similarity between targets and predictions of the 
same source and dissimilarity between targets and predictions 
of competing sources, we observe that interference is further 
reduced while maintaining similar or higher SDRs and SARs. 

V. Conclusion and Future work 

In this paper, we explore various deep learning architectures, 
including deep neural networks and deep recurrent neural 
networks for monaural source separation problems. We en¬ 
hance the performance by jointly optimizing a soft time- 
frequency mask layer with the networks in an end-to-end 
fashion and exploring a discriminative training criterion. We 
evaluate our proposed method for speech separation, singing 
voice separation, and speech denoising tasks. Overall, our 
proposed models achieve 2.30-4.98 dB SDR gain compared to 
the NMF baseline, while maintaining higher SIRs and SARs 
in the TSP speech separation task. In the MIR-IK singing 
voice separation task, our proposed models achieve 2.30-2.48 
dB GNSDR gain and 4.32-5.42 dB GSIR gain, compared to 
the previously proposed methods, while maintaining similar 
GSARs. Moreover, our proposed method also outperforms 


NMF and DNN baselines in various mismatch conditions 
in the TIMIT speech denoising task. To further improve 
the performance, one direction is to further explore using 
LSTMs to model longer temporal information | [40| , which has 
shown great performance compared to conventional recurrent 
neural networks as LSTM has properties of avoiding vanishing 
gradient properties. In addition, our proposed models can also 
be applied to many other applications such as robust ASR. 
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